Introduction:

Within our assignment we have 2 datasets the crime23 and the temp2023 datasets for Colchester:

The crime23.csv dataset contains detailed information about street-level crime incidents. It includes variables such as category, persistent_id, date, latitude, longitude, street_id, street_name, context, id, location_type, location_subtype, and outcome_status. For a comprehensive understanding of these variables, we can refer to the dataset description provided in the interface: https://ukpolice.njtierney.com/reference/ukp_crime.html.

On the other hand, the temp2023.csv dataset comprises daily climate data captured from a weather station in proximity to Colchester. It includes variables such as station_ID, Date, TemperatureCAvg, TemperatureCMax, TemperatureCMin, TdAvgC, HrAvg, WindkmhDir, WindkmhInt, WindkmhGust, PresslevHp, Precmm, TotClOct, lowClOct, SunD1h, VisKm, PreselevHp, and SnowDepcm. We can find a detailed description of these variables and the extraction interface at https://bczernecki.github.io/climate/reference/meteo_ogimet.html.

Throughout this analysis, we will explore the relationships between crime incidents and climatic conditions, uncover patterns, and derive valuable insights to aid decision-making processes. Data cleaning steps, including the removal of columns PreselevHp and SnowDepcm, as well as handling missing values (NA values), will be conducted to ensure the integrity and quality of the analysis.

library(ggplot2) 
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Importing our crime dataset

crime<-read.csv("crime23.csv")
head(crime)
##                category persistent_id    date      lat     long street_id
## 1 anti-social-behaviour               2023-01 51.88306 0.909136   2153366
## 2 anti-social-behaviour               2023-01 51.90124 0.901681   2153173
## 3 anti-social-behaviour               2023-01 51.88907 0.897722   2153077
## 4 anti-social-behaviour               2023-01 51.89122 0.901988   2153186
## 5 anti-social-behaviour               2023-01 51.89416 0.895433   2153012
## 6 anti-social-behaviour               2023-01 51.88050 0.909014   2153379
##                     street_name context        id location_type
## 1      On or near Military Road      NA 107596596         Force
## 2                   On or near       NA 107596646         Force
## 3 On or near Culver Street West      NA 107595950         Force
## 4       On or near Ryegate Road      NA 107595953         Force
## 5       On or near Market Close      NA 107595979         Force
## 6         On or near Lisle Road      NA 107595985         Force
##   location_subtype outcome_status
## 1                            <NA>
## 2                            <NA>
## 3                            <NA>
## 4                            <NA>
## 5                            <NA>
## 6                            <NA>

PLotting the distribution of crime categories

category_plot <- ggplot(crime, aes(x = reorder(category, -table(category)[category]), fill = category)) +
  geom_bar(color = "black", size = 0.5) + 
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Category of Crime", y = "Frequency", title = "Distribution of Crime Categories") +
  theme_bw() +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_text(size = 12),  # Adjusting the size of the axis
        plot.title = element_text(size = 16, face = "bold"))  # Adjusting the plot size
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Converting our ggplot into plotly
category_plot_interactive <- ggplotly(category_plot)
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set3 is 12
## Returning the palette you asked for with that many colors
category_plot_interactive

From this graph we can now interpret the distribution of different crime categories through their frequencies (number of times it occurred). Thus, from the above graph we can interpret that violent-crime has the highest number of crimes, followed by anti-social behavior, followed by criminal damage arson, followed by shoplifting, then public order, then over theft vehicle, then bicycle theft, then burglary, then drugs, then robbery, then the other crimes, then the theft from the person, ending till the possession of weapons. Helping us gain an insight into the number of violent crimes and the least number of possession of weapons crime in Colchester.

Plotting the outcome status of the crimes :

library(ggplot2)
outcome_status_plot <- ggplot(crime, aes(x = outcome_status, fill = outcome_status)) +
  geom_bar(color = "black", size = 0.5) +  
  labs(x = "Outcome Status", y = "Frequency", title = "Outcome Status of Crimes") +
  theme_bw() +  # Change theme to black and white
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_text(size = 12),  
        plot.title = element_text(size = 16, face = "bold")) +  
  scale_fill_manual(values = rainbow(length(unique(crime$outcome_status)))) +  
  guides(fill = guide_legend(title = "Outcome Status"))  
outcome_status_plot_interactive <- ggplotly(outcome_status_plot)
outcome_status_plot_interactive

From the above graph we will now interpret the sights gained from the outcomes that we received from the crimes, helping us gain the insight that the highest number of outcomes resulted in “Investigation complete; no suspect identified”, followed by the outcome “Unable to prosecute suspect”. Concluding the outcome, “Suspect charged as part of another case” was the least number of outcome, as this outcome only came once. In between these maximum and the least number of outcomes there were several other outcomes, such as “Action to be taken by another organization”, “Awaiting court outcome”,“Court result unavailable”,“Formal action is not in the public interest”,“Further action is not in the public interest”,“Further investigation is not in the public interest”,“Local resolution”,“Offender given a caution”,“Status update unavailable” and “Under investigation”.

table(crime$outcome_status)
## 
##          Action to be taken by another organisation 
##                                                 104 
##                              Awaiting court outcome 
##                                                 260 
##                            Court result unavailable 
##                                                 206 
##         Formal action is not in the public interest 
##                                                   9 
##        Further action is not in the public interest 
##                                                  82 
## Further investigation is not in the public interest 
##                                                   8 
##       Investigation complete; no suspect identified 
##                                                2656 
##                                    Local resolution 
##                                                 239 
##                            Offender given a caution 
##                                                  61 
##                           Status update unavailable 
##                                                 177 
##             Suspect charged as part of another case 
##                                                   1 
##                         Unable to prosecute suspect 
##                                                1959 
##                                 Under investigation 
##                                                 439

Plotting crime by the location type:

location_type_plot <- ggplot(crime, aes(x = reorder(location_type, -table(location_type)[location_type]), fill = location_type)) +
  geom_bar(color = "black", alpha = 0.8, width = 0.7) +  
  scale_fill_brewer(palette = "Dark2") +  
  labs(x = "Location Type", y = "Number of Crimes", title = "Crimes by Location Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10, color = "black"),  
        axis.text.y = element_text(size = 10, color = "black"),
        axis.title = element_text(size = 12, color = "black"),
        plot.title = element_text(hjust = 0.5, size = 18, color = "black"),
        legend.position = "right",  # Move legend to the right side
        legend.title = element_text(size = 12, color = "black"),
        legend.text = element_text(size = 10, color = "black"),
        panel.grid.major = element_blank(), 
        panel.border = element_blank(),  
        panel.background = element_rect(fill = "lightgray", color = "black")) +  
  theme(plot.title = element_text(hjust = 0.5)) 
location_type_plot_interactive <- ggplotly(location_type_plot)
location_type_plot_interactive

From the above graph we can interpret the most of the location type being “Force” the general law enforcement locations , rather than the “BTP” which represents the locations under the jurisdiction of the British Transport Police. Thus through understanding of these factors can better help more better resource allocation and strategic planning helping more crime prevention.

Plotting a two way table for Category and Location Type:

two_way_table <- table(crime$category, crime$location_type)
two_way_table
##                        
##                          BTP Force
##   anti-social-behaviour    0   677
##   bicycle-theft            4   231
##   burglary                 0   225
##   criminal-damage-arson    1   580
##   drugs                    0   208
##   other-crime              0    92
##   other-theft              4   487
##   possession-of-weapons    0    74
##   public-order             6   526
##   robbery                  0    94
##   shoplifting              0   554
##   theft-from-the-person    0    76
##   vehicle-crime            1   405
##   violent-crime            8  2625

This two-way table illustrates the distribution of crime categories between the British Transport Police (BTP) and normal police force locations (Force). Across the various categories, it is evident that Force predominantly handles the majority of reported incidents, with anti-social behaviour, criminal damage/arson, and violent crime being particularly noteworthy, accounting for 677, 580, and 2625 incidents, respectively. In contrast, BTP records comparatively fewer incidents, with notable exceptions including bicycle theft (4 incidents) and vehicle crime (1 incident). This distribution suggests a distinct pattern in the types of incidents reported to each policing entity, highlighting the differing roles and responsibilities between BTP and Force in addressing specific types of criminal activity.

# Converting the two-way table into a data frame
two_way_df <- as.data.frame(two_way_table)
names(two_way_df) <- c("Category", "BTP", "Force")

#Melting for better plotting
melted_df <- reshape2::melt(two_way_df, id.vars = "Category")
## Warning: attributes are not identical across measure variables; they will be
## dropped
grouped_bar_plot_interactive <- plot_ly(melted_df, x = ~Category, y = ~value, color = ~variable, type = "bar") %>%
  layout(title = "Frequency of Crime Categories by Location Type",
         xaxis = list(title = "Category of Crime"),
         yaxis = list(title = "Frequency"),
         barmode = "group")
grouped_bar_plot_interactive
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

From the above diagram we now analyze different crime categories along different location types. We analyze an approximate similar “Force” type for every category. However, we analyze the highest “BTP” location in public order and violent crime. However, the least “BTP” could be observed in the drugs category.

Plotting the top 10 crime streets :

library(dplyr)
library(ggplot2) 
library(plotly)

top_10_streets <- crime %>%
  group_by(street_name) %>%
  summarise(total_crimes = n()) %>%
  arrange(desc(total_crimes)) %>%
  top_n(10)
## Selecting by total_crimes
crime_top_10 <- crime %>%
  filter(street_name %in% top_10_streets$street_name)

crime_type_analysis_top_10 <- ggplot(crime_top_10, aes(x = street_name, fill = category)) +
  geom_bar(position = "stack") +
  labs(x = "Street Name", y = "Number of Crimes", title = "Crime Type Analysis for Top 10 Streets") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
        legend.position = "bottom",  # Change legend position to bottom
        legend.key.size = unit(0.1, "mm"))  # Adjust legend key size

# Converting our ggplot plot into plotly 
crime_type_analysis_top_10_interactive <- ggplotly(crime_type_analysis_top_10)
crime_type_analysis_top_10_interactive

From this above graph we plot the top 10 streets of Colchester. From the above interpretation, we observe the street name” ““on or near” having the highest number of crimes and among, the anti-social behavior crime is observed at the highest number of crimes, violent crime is observed at the least number.Moreover, our top 10 street list category, we can observe the street name “on or near George Street” having the least number of crimes. However, yet again, even in this street, we can observe anti-social behavior having the highest number of crimes and violent crime being the least number. In-between these streets we had some other major streets as well, such as “on or near Balkerne Gardens”, “on or near Church Street”, “on or near Cowdray Avenue”, “on or near Nighclubs”, “on or near Oarking Area”,“on or near Shopping Area”, “on or near St Nicholas Street” and the “on or near Supermarket”.

Converting the crime data date into a date format :

crime$date <- paste0("1-", crime$date)
crime$date <- as.Date(crime$date, format = "%d-%Y-%m")

Calculating the total number of crimes of monthly basis :

crime$date <- as.Date(crime$date)

crime$month <- format(crime$date, "%Y-%m")

# Grouping by month and counting the number of occurrences
monthly_crime <- crime %>%
  group_by(month) %>%
  summarise(total_crimes = n())

# Printing the first few rows of monthly crime
head(monthly_crime)
## # A tibble: 6 × 2
##   month   total_crimes
##   <chr>          <int>
## 1 2023-01          651
## 2 2023-02          467
## 3 2023-03          555
## 4 2023-04          574
## 5 2023-05          586
## 6 2023-06          563

Plotting monthly crime count overtime :

monthly_crime$month <- as.Date(paste0(monthly_crime$month, "-01"))

# Plotting a line plot of monthly crime counts over time
monthly_crime_plot <- ggplot(monthly_crime, aes(x = month, y = total_crimes)) +
  geom_line(color = "skyblue", size = 1) +  
  labs(x = "Month", y = "Total Crimes", title = "Monthly Crime Counts Over Time") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),  
        panel.grid.major = element_line(color = "gray", linetype = "dotted"), 
        panel.grid.minor = element_blank())  

# Converting ggplot object into plotly
interactive_monthly_crime_plot <- ggplotly(monthly_crime_plot)
interactive_monthly_crime_plot

From the above graph we can interpret the highest number of crimes in January. After that significant we have observed a significant fall in February, then a rise in crime was observed till May, then an uncertain fall in the crime rate was observed till August. However, right after August, a significant rise was observed till September. And a fall is to be observed from then till the end of the year in the crime rates in Colchester.

Encoding Categorical Variables into Dummy Variables in Crime Dataset :

crime_encoded <- model.matrix(~ 0 + category, data = crime)
crime_encoded <- as.data.frame(crime_encoded)

Computing the correlation matrix for all the crime categories

correlation_matrix_crime <- cor(crime_encoded)

# Creating a heatmap of the correlation matrix
heatmap_plot <- ggplot(data = reshape2::melt(correlation_matrix_crime)) +
  geom_tile(aes(Var2, Var1, fill = value)) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0,
                       limits = c(-1,1), name="Correlation",
                       breaks=seq(-1, 1, by=0.2)) +
  theme_minimal() +
  labs(title = "Correlation Heatmap of Crime Categories") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "right") +
  geom_text(aes(Var2, Var1, label = round(value, 2)), color = "black")
heatmap_plot

Keeping in mind the correlation matrix criteria of 1 being a strong correlation, the range between 0.5-0.75 being a moderate correlation, a correlation between 0.3-0.5 being a weak correlation and a correlation below 0.3 be a very weak correlation. From the above correlation matrix plot we observe almost all the categories having a negative correlation with that of the other categorical crime variables. As we can observe, the highest negative correlation between violent crime and social behavior. Such that if violent crime increases the anti-social behavior, crime decreases. Followed by the second strongest correlation between violent crime and shoplifting . However, the least negative and strongest correlation could be observed between violent crime and vehicle crimes. Indicating that if violent crime increases, vehicle crime decreases but at a lower level.

Scatter plot in between the longitude and the Latitude

ggplot(data = crime, aes(x = lat, y = long)) +
  geom_point() +
  labs(x = "Latitude", y = "Longitude", title = "Crime Locations")

In this scatter plot we analyze where we are to observe the crimes utilizing the latitude and the longitude coordinates. Helping us analyze that at the coordinates around 51.880-51.89 latitude to 0.89-0.91 longitude, most number of crimes are observed. Thus giving valuable insights for the police to increase their security within such coordinates, helping increase the security system and helping in the prevention of crimes in Colchester.​

Plotting a Leaflet / Map for Colchester:

library("dplyr")
library("leaflet")
## Warning: package 'leaflet' was built under R version 4.3.3
crime_map <- leaflet(data = crime) %>%
  addTiles() %>%
  addCircleMarkers(lng = ~long, lat = ~lat,
                   radius = 5, fillOpacity = 0.7, color = "red", stroke = FALSE) %>%
  setView(lng = mean(crime$long), lat = mean(crime$lat), zoom = 10)

crime_map

Now plotting those coordinates on the leaflet / a map. Helping us give an insight of the exact location where a majority of the crimes are happening in Colchester giving the police a major insight , rather a representation of where to deploy the major police force helping them control and limit crimes within Colchester.

Crime Count across different Seasons :

library("ggplot2")
library("plotly")

#Extracting months from date
crime$month <- as.integer(format(as.Date(crime$date), "%m"))
# Breaking our months into seasons now
get_season <- function(month) {
  if (month %in% c(3, 4, 5)) {
    return("Spring")
  } else if (month %in% c(6, 7, 8)) {
    return("Summer")
  } else if (month %in% c(9, 10, 11)) {
    return("Autumn")
  } else {
    return("Winter")
  }
}

crime$season <- sapply(crime$month, get_season)
season_colors <- c("Spring" = "#FFA07A", "Summer" = "#FF6347", "Autumn" = "#FF4500", "Winter" = "#4682B4")
crime_by_season <- ggplot(crime, aes(x = season, fill = season)) +
  geom_bar() +  # Use fill aesthetic for custom colors
  labs(x = "Season", y = "Number of Crimes", title = "Number of Crimes by Season") +
  scale_fill_manual(values = season_colors) +  # Use custom colors
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
    panel.grid.major.y = element_line(color = "gray", size = 0.2),
    panel.grid.minor = element_blank(),  
    axis.line = element_line(color = "black"),  
    axis.title = element_text(size = 12, face = "bold"),  
    axis.text = element_text(size = 10)  
  )
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Coverting ggplot object into plotly object
crime_by_season_interactive <- ggplotly(crime_by_season)
crime_by_season_interactive

Here we have first divided our data set into several seasons, in order to analyze in which season we are expected to interpret more crimes, thus helping the residents of Colchester be more prepared for crimes in that specific season as compared to other seasons. From the above graph we can observe that autumn has the highest number of crimes, followed by the spring, then the summer and, finally, the least number of crimes can be observed in winter.

Variation in crime counts across different seasons

# Breaking our months into seasons
get_season <- function(month) {
  if (month %in% c(3, 4, 5)) {
    return("Spring")
  } else if (month %in% c(6, 7, 8)) {
    return("Summer")
  } else if (month %in% c(9, 10, 11)) {
    return("Autumn")
  } else {
    return("Winter")
  }
}

crime_heatmap_season <- crime %>%
  group_by(season, category) %>%
  summarise(crime_count = n())
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
# Now creating a  heatmap plot 
heatmap_plot_season <- ggplot(crime_heatmap_season, aes(x = season, y = category, fill = crime_count)) +
  geom_tile(color = "white") +  # Add white border to tiles
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Crime Count") +  # Adjust color gradient
  labs(x = "Season", y = "Crime Type", title = "Variation in Crime Counts Across Seasons and Crime Types") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_text(size = 8),  # Adjust text size for y-axis
        axis.title = element_text(size = 12),  # Adjust title size
        legend.position = "right",  # Move legend to the right
        legend.title = element_text(size = 10),  # Adjust legend title size
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"))  # Adjust title properties
heatmap_plot_season

From the above graph we can interpret a similar trend in the crimes throughout every season, with the violent crimes being the highest for every season. However, more anti-social behavior crimes are observed in the autumn and the summers as compared to the spring and the winters. Moreover, more of crime damage arson crime is observed within the autumn and the spring. More public order crime is also observed in the autumn and the spring as compared to summer and winter. Other than that a similar trend is observed in all the crime categories for all the seasons.

Now Plotting for Temperature data

temp_data<-read.csv("temp2023.csv")

Removing columns Sea level pressure and Depth of snow cover because of it having null values

temp <- temp_data[, !(names(temp_data) %in% c("PreselevHp", "SnowDepcm"))]
#Converting into date format
temp$Date <- as.Date(temp$Date)
head(temp)
##   station_ID       Date TemperatureCAvg TemperatureCMax TemperatureCMin TdAvgC
## 1       3590 2023-12-31             8.7            10.6             4.4    7.2
## 2       3590 2023-12-30             6.6             9.7             4.4    4.2
## 3       3590 2023-12-29             9.9            11.4             6.9    6.0
## 4       3590 2023-12-28             9.9            11.5             4.0    7.5
## 5       3590 2023-12-27             5.8            10.6             3.9    3.7
## 6       3590 2023-12-26             9.8            12.7             6.3    7.6
##   HrAvg WindkmhDir WindkmhInt WindkmhGust PresslevHp Precmm TotClOct lowClOct
## 1  89.6          S       25.0        63.0      999.0    6.2      8.0      8.0
## 2  85.5        WSW       22.7        50.0     1006.9    0.4      4.6      6.5
## 3  77.2         SW       32.8        61.2     1003.6    0.8      6.5      6.7
## 4  84.6        SSW       32.2        70.4     1003.2    2.8      6.8      7.1
## 5  86.4         SW       13.2        37.1     1016.4    2.0      4.0      6.9
## 6  86.9        WSW       23.5        46.3     1006.2    4.4      6.5      7.4
##   SunD1h VisKm
## 1    0.0  26.3
## 2    1.1  48.3
## 3    0.1  26.7
## 4    0.0  25.1
## 5    3.2  30.1
## 6    0.0  45.8

Converting Daily data into monthly data :

if (!requireNamespace("lubridate", quietly = TRUE)) {
  install.packages("lubridate")
}

library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.3
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
# Converting the date column into a date format
temp$Date <- as.Date(temp$Date, format="%Y-%m-%d")

# Extracting the month from the Date column
temp$Month <- month(temp$Date, label = TRUE, abbr = FALSE)

# Grouping the temperature data on monthly 
monthly_data <- temp %>%
  group_by(Month) %>%
  summarise(
    AvgTemp = mean(TemperatureCAvg, na.rm = TRUE),
    MaxTemp = mean(TemperatureCMax, na.rm = TRUE),
    MinTemp = mean(TemperatureCMin, na.rm = TRUE),
    .groups = 'drop'
  )

print(monthly_data)
## # A tibble: 12 × 4
##    Month     AvgTemp MaxTemp MinTemp
##    <ord>       <dbl>   <dbl>   <dbl>
##  1 January      4.80    8.04    1.31
##  2 February     5.91    9.93    1.42
##  3 March        6.44    9.70    2.35
##  4 April        8      12.5     3.38
##  5 May         11.5    16.2     6.54
##  6 June        16.9    22.5    10.9 
##  7 July        16.7    21.8    11.0 
##  8 August      16.5    21.7    11.3 
##  9 September   17.3    22.4    12.2 
## 10 October     12.7    16.7     8.49
## 11 November     7.11   10.4     3.54
## 12 December     6.86    9.40    3.66

Here we successfully converted our daily date set into monthly.

Temperature Range Distribution :

library(plotly)

temp_freq <- table(cut(temp$TemperatureCAvg, breaks = 5))
temp_df <- data.frame(Temperature_Range = names(temp_freq), Frequency = temp_freq)

# Creating a pie chart 
plot_ly(data = temp_df, labels = ~Temperature_Range, values = ~Frequency.Freq, type = "pie") %>%
  layout(title = "Temperature Range Distribution")

From the above graph, we can observe a division of our average temperature ranges into 5 slices, indicating that we have the highest range of temperature at between 7.68-12.8, followed by the temperature range between 12.18-18, followed by 2.54-7.68, followed by 18-23.1 and lastly the least proportion the temperature range between -2.63-2.54.

Density plot :

density_plot_avgtemp<-ggplot(temp, aes(x = TemperatureCAvg)) +
  geom_density(fill = "skyblue", color = "black") +
  labs(x = "Average Temperature (C)", y = "Density", title = "Density Plot of Average Temperature")
density_plot_avgtemp

From the above density plot curve have the average temperature on the x-axis while its density on the y-axis. Helping us interpret that around 10 degree Celsius is the most frequent, indicating that on most days the temperature has been around 10 degree Celsius within Colchester, while the least occurrence of temperature has been 25 degree Celsius within Colchester.

Temperature Distribution :

temperature_distribution_violin <- ggplot(temp, aes(x = "", y = TemperatureCAvg)) +
  geom_violin(fill = "skyblue") +
  labs(x = "", y = "Average Temperature (C)", title = "Temperature Distribution")

temperature_distribution_violin

From the above violin graph we can interpret the maximum distribution of our dataset around 7.5-10, in the middle of the average temperature dataset for Colchester. However, as we go into the two extremes of the temperature. We can observe less and less distribution of the dataset on the graph.

Distribution of Wind Direction :

wind_direction_counts <- table(temp$WindkmhDir)
wind_direction_df <- as.data.frame(wind_direction_counts)
names(wind_direction_df) <- c("Wind Direction", "Frequency")

# Creating a pie chart
pie_chart <- ggplot(wind_direction_df, aes(x = "", y = Frequency, fill = `Wind Direction`)) +
  geom_bar(stat = "identity") +
  coord_polar("y") +  
  labs(fill = "Wind Direction", title = "Distribution of Wind Directions") + 
  theme_minimal()  
pie_chart

From the above diagram we can observe the maximum wind direction moving towards WSW(58 times), then SW(49 times), then SSW(41 times). We observe the least amount of wind direction going towards the SE with only 6 times going towards this direction.

table(temp$WindkmhDir)
## 
##   E ENE ESE   N  NE NNE NNW  NW   S  SE SSE SSW  SW   W WNW WSW 
##  10  20  11  18  22  15  13  15  26   6  14  41  49  34  13  58

Plotting correlation between Wind Speed and Wind Gust :

correlation_plot <- ggplot(temp, aes(x = WindkmhInt, y = WindkmhGust)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Wind Speed (km/h) - Int", y = "Wind Speed (km/h) - Gust", title = "Correlation between Wind Speed (Int) and Wind Speed (Gust)")

correlation_plot
## `geom_smooth()` using formula = 'y ~ x'

From the above plot we can observe some major insights from the correlation scatter plot in between the wind speed and the wind gust. Concluding to have a linear relationship in between. Moreover, we also conclude that as one of the variables increases, the other is also expected to increase. Thus, also from a central linear straight line, we confirm its linear relationship. And from all the scattered plots near and around our linear line.

Boxplot of Precipatation Distribution by Temperature Range :

temperature_precipitation_boxplot <- ggplot(temp, aes(x = cut(TemperatureCAvg, breaks = 5), y = Precmm, fill = cut(TemperatureCAvg, breaks = 5))) +
  geom_boxplot(color = "black") +
  scale_fill_brewer(palette = "Set3", name = "Temperature Range") +  
  labs(x = "Temperature Range", y = "Precipitation (mm)", title = "Precipitation Distribution by Temperature Range") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Converting the ggplot object into an interactive plotly 
interactive_temperature_precipitation_boxplot <- ggplotly(temperature_precipitation_boxplot)
## Warning: Removed 27 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
interactive_temperature_precipitation_boxplot

From the above graph we observe outliers in all of our dataset. We can observe the highest points of the precipitation dataset within the range 12.8-18. However, if we go in extreme weather such as -2.63 and 23.1. We observe little precipitation at such points. Moreover, we can also observe that as the weather rises, more frozen water is observed in the atmosphere that falls back on the earth.

Creating a time series plot for our temperature dataset :

temperature_plot <- ggplot(temp, aes(x = Date)) +
  geom_line(aes(y = TemperatureCAvg, color = "Average Temperature")) +
  geom_line(aes(y = TemperatureCMax, color = "Maximum Temperature")) +
  geom_line(aes(y = TemperatureCMin, color = "Minimum Temperature")) +
  labs(x = "Date", y = "Temperature (C)", color = "Temperature") +
  ggtitle("Temperature Over Time") +
  theme_minimal()
# Converting the ggplot object to an interactive Plot
interactive_temperature_plot <- ggplotly(temperature_plot)
interactive_temperature_plot

From the above graph we can observe our temperature in degrees celsius on the y-axis and the date on our x-axis. Moreover, by plotting three lines such that the red representing the “Average Temperature”, the green representing the “Maximum Temperature” and the blue representing the “Minimum Temperature”. We can gain insight that the average temperature tends to have a higher temperature within the summer months (around July 2023), while having lower temperatures in the winter months (around January for both 2023 and 2024). Moreover, we can also observe that the maximum temperature follows the same trend as the average temperature but at a more extreme extent such that, for example, in July 2023, the maximum temperature is higher than that of the average temperature. And lastly, we can observe a similar trend for the minimum temperature following a similar trend to the average temperature but at a lower level such that there is a more significant drop in the minimum temperature for the month of January for both the years 2023 and 2024 in comparison to the average temperature.

Distribution of Temperature Over Time :

distribution_temperature <- ggplot(temp, aes(x = Date, y = TemperatureCAvg)) +
  geom_smooth(color = "skyblue", fill = "lightblue", method = "loess") +
  labs(x = "Date", y = "Average Temperature (C)", title = "Distribution of Temperature Over Time") +
  theme_minimal()
interactive_distribution_temperature <- ggplotly(distribution_temperature)
## `geom_smooth()` using formula = 'y ~ x'
interactive_distribution_temperature

From the graph above we can observe, we can observe the “date” on the x-axis and the “Average Temperature” on the y-axis. We are now smoothing our graph are gaining a peak maximum average temperature at 17.8 around August 2023. However, due to smoothing, we are also able to create a cushion. And we are able to create a predictive estimate of a max 18.39 upper while a lower predictive estimate of 17.19.

Now Merging the data set Temperature and Crime:

temp_d <- read.csv("temp2023.csv")
crime_d <- read.csv("crime23.csv")

#Converting temperature data into a date format: 
temp_d$date <- as.Date(temp_d$Date)

# Converting temperature data as "YYYY-MM" format :
temp_d$date <- format(temp_d$date, "%Y-%m")

#Merging both the files of crime and temperature :

combined_d <- merge(crime_d, temp_d, by = "date")

head(combined_d)
##      date              category persistent_id      lat     long street_id
## 1 2023-01 anti-social-behaviour               51.88306 0.909136   2153366
## 2 2023-01 anti-social-behaviour               51.88306 0.909136   2153366
## 3 2023-01 anti-social-behaviour               51.88306 0.909136   2153366
## 4 2023-01 anti-social-behaviour               51.88306 0.909136   2153366
## 5 2023-01 anti-social-behaviour               51.88306 0.909136   2153366
## 6 2023-01 anti-social-behaviour               51.88306 0.909136   2153366
##                street_name context        id location_type location_subtype
## 1 On or near Military Road      NA 107596596         Force                 
## 2 On or near Military Road      NA 107596596         Force                 
## 3 On or near Military Road      NA 107596596         Force                 
## 4 On or near Military Road      NA 107596596         Force                 
## 5 On or near Military Road      NA 107596596         Force                 
## 6 On or near Military Road      NA 107596596         Force                 
##   outcome_status station_ID       Date TemperatureCAvg TemperatureCMax
## 1           <NA>       3590 2023-01-02             8.4            13.1
## 2           <NA>       3590 2023-01-01            10.4            13.1
## 3           <NA>       3590 2023-01-28             3.0             7.1
## 4           <NA>       3590 2023-01-27             4.7             8.1
## 5           <NA>       3590 2023-01-26             2.4             5.1
## 6           <NA>       3590 2023-01-25             2.3             3.9
##   TemperatureCMin TdAvgC HrAvg WindkmhDir WindkmhInt WindkmhGust PresslevHp
## 1             5.0    6.4  88.4        SSW       15.0        50.0     1009.0
## 2             9.1    7.0  79.5         SW       27.0        53.7     1004.0
## 3             0.2    0.8  86.0        NNE       11.5        31.5     1030.9
## 4             2.2    3.5  91.7          N       20.2        44.5     1029.8
## 5            -0.3    2.2  97.9        WNW       16.1        35.2     1029.6
## 6             0.3    0.1  85.5         NE        6.9        20.4     1038.4
##   Precmm TotClOct lowClOct SunD1h VisKm PreselevHp SnowDepcm
## 1    3.0      6.1      6.7     NA  16.4         NA        NA
## 2    8.2      4.4      6.2     NA  37.9         NA        NA
## 3    0.0      6.2      6.8     NA  32.8         NA        NA
## 4    0.0      8.0      8.0     NA  20.0         NA        NA
## 5    3.2      7.1      7.7     NA   5.4         NA        NA
## 6    0.0      8.0      8.0     NA  20.2         NA        NA

Getting Columns for our combined dataset

colnames(combined_d)
##  [1] "date"             "category"         "persistent_id"    "lat"             
##  [5] "long"             "street_id"        "street_name"      "context"         
##  [9] "id"               "location_type"    "location_subtype" "outcome_status"  
## [13] "station_ID"       "Date"             "TemperatureCAvg"  "TemperatureCMax" 
## [17] "TemperatureCMin"  "TdAvgC"           "HrAvg"            "WindkmhDir"      
## [21] "WindkmhInt"       "WindkmhGust"      "PresslevHp"       "Precmm"          
## [25] "TotClOct"         "lowClOct"         "SunD1h"           "VisKm"           
## [29] "PreselevHp"       "SnowDepcm"

Plotting average humidity for the top 10 streets :

# Calculating average humidity
avg_humidity <- combined_d %>%
  group_by(street_name) %>%
  summarise(avg_humidity = mean(HrAvg, na.rm = TRUE)) %>%
  arrange(desc(avg_humidity))

ggplot_object <- ggplot(avg_humidity[1:10, ], aes(x = reorder(street_name, avg_humidity), y = avg_humidity)) +
  geom_bar(stat = "identity", fill = "#4CAF50", color = "black", alpha = 0.8) +  
  labs(x = "Street Name", y = "Average Humidity (%)", title = "Average Humidity for Top 10 Streets") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10, color = "black"),  
        axis.text.y = element_text(size = 10, color = "black"),  
        axis.title = element_text(size = 12, color = "black"),  
        plot.title = element_text(hjust = 0.5, size = 16, color = "black"),  
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank(),  
        panel.border = element_blank(),  
        legend.position = "none")  

plotly_object <- ggplotly(ggplot_object)
plotly_object

In this graph we tried to calculate the average humidity for the top street names in Colchester. We can interpret the highest average humidity on the street “on or near Ireton Road” at 88.9, while the lower humidity on the street “on or near Colne Bank Avenue” is at 87.19. In between, we may have street names “on or near Highfield”,“on or near Norwich Close”,“on or near St Augustine Mews”,“on or near Brookside”,“on or near Carlisle”,“on or near Charles Street”,“on or near Bristol Road”, and “on or near Christ Church Court”.

Conclusion

In summary, the analysis of Colchester’s policing dataset for 2023, alongside daily climate data, offers valuable insights into crime patterns and their potential correlations with weather conditions. By examining, we observed several major insights, such as violent crime being at its highest peak throughout the years in Colchester. Moreover, we also observed the highest number of outcomes resulting in “Investigation complete; no suspect identified”. Concluding that the highest number of criminals had never been punished. Indicating that a majority of criminals easily get away with their crimes. Motivating the criminals for more crime. We also observed the highest number of crimes in the month of January of each month, and were also able to highlight streets and times when major crime happens again, highlighting for the police to be more vigilant and help in the prevention of more crimes in Colchester. However, if we dive in to our temperature dataset, we conclude a similar trend in the average temperature trends between the maximum and the minimum temperature. Giving us an insight into lower temperatures in summers and higher temperatures in winters. Moreover, we also conclude a linear correlation between the wind speed and the wind gust, concluding that if one increases, the other also increases.

References

  1. https://ukpolice.njtierney.com/reference/ukp_crime.html
  2. https://bczernecki.github.io/climate/reference/meteo_ogimet.html
  3. https://www.rdocumentation.org/packages/ggplot2/versions/3.5.0
  4. https://www.rdocumentation.org/packages/plotly/versions/4.10.4